Dynamic model data facility and automated operational model building and usage

ABSTRACT

A commercial process with a dependent variable can be associated with a set of independent variables. The commercial process can continuously provide data collection opportunities. An intervention is designed using a model to predict the dependent outcome. The actual outcome of the intervention can be determined within the window of utility for these data. One objective is to improve intervention outcomes with prediction. Purely random outcomes (no model prediction) and outcomes resulting from the intervention (model operations) are aggregated into separate files—a sequence of control model data files and a sequence of model data files of operational data. These model data files and control model data files are used to analyze model performance and to react automatically when identified conditions warrant.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a division of U.S. patent application Ser. No.14/060,820, filed Oct. 23, 2013, which claims the benefit of U.S.Provisional Application No. 61/740,257, filed Dec. 20, 2012, the entiredisclosures of which are hereby incorporated herein by reference.

FIELD OF THE DISCLOSURE

The present disclosure is generally directed toward analytics andspecifically toward business-embedded analytics systems and methods.

BACKGROUND

Business today operates in a dynamic and uncertain global environment.The flood of scattered, often random data that pours in each daycompounds the complexity of business decision making. Businessintelligence (BI) collects and structures that data, and datawarehousing stores it, but currently BI solutions still fall short ofsuccessfully collecting and using the data and the business wisdom itcontains to optimize business operations in a timely fashion.

Analytics provides the applications, techniques, and technologies forunderstanding performance drivers in an organization and forimplementing fact-based, model-driven decision-making. Analytics focuseson extrapolating trends and making predictions and on enablingactionable insights that can drive decision-making.

An effective Operational Intelligence Platform is integrated throughoutthe enterprise to enable model-based decisions and connect businessusers at every level—strategic, tactical, and operational—with businessdata. To achieve its transformative potential, operational analyticsmust not be confined to power users but must be accessible to employeeson the front lines who interact with customers and make operationaldecisions every day. In other words, operational analytics must bebusiness-embedded.

While this potential is ignored by some 93 percent of businesses today,enterprises are recognizing that analytics will be the operational heartof a competitive and successful business in an interconnected worldcharacterized by massive information access, a demanding customer base,scarcity of capital, shifting distribution channels, a mobile workforce, and intense cost competition.

SUMMARY

It is one aspect of the present disclosure to provide a completeOperational Intelligence Platform. In particular, it is one aspect ofthe present disclosure to provide a dynamic model data facility. It isanother aspect of the present disclosure to provide techniques,algorithms, systems, and operational data management artifacts forbuilding, using, and updating such dynamic models to be operative andaccurate on a business real-time basis while increasing data mining andpredictive analysis professionals' productivity by an order ofmagnitude.

Dynamic Model Data Facility

Predictive analytics and operational models relies heavily on data beingsyntactically and semantically organized or structured in a manner thatfacilitates its use by statistical applications for predictive modelingand usage. In general, all data related to an entity is organized into asingle row or observations with values occurring in specific columns. Inorder for the analytics to be effective, unlike many other organizedfiles, the data is processed and produced so that it is not sparse andtypically gives a historical view of the entity between instances of thedata. The values within the rows contain a mix of instantaneous andcumulative amounts or counts of data. While dates may be useful, theinformation the date confers tends to be computed and placed into thefile and along with values that represent counts and totals. Forexample, the number of months one has been a customer would be usedinstead of their signup date; the number of purchases made in the last30, 60, and 90 days as three separate values; or the number of dayssince their last purchase.

In accordance with at least some embodiments, a model data file (MDF) isprovided and mechanisms for generating such MDFs on a periodic basis(e.g., daily, weekly, monthly, etc.) with updates that have merged withthe previous historical values are also provided.

In some embodiments, each file may be configured to provide a historicalperspective of the entire set of entities and can be used to compareinformation across a timeline. The effort to construct these filesrequires incorporating and normalizing data from multiple sources in acomplex process of Extract, Transform and Load (ETL) and entails notonly a matter of joining values from different sources, but alsoaggregating values or counts across one or more periods (such as thepurchases in the last N days example above). These ETL processes havehistorically been driven by a finite set of relatively static datasources and are usually flat file extracts. Because of the method ofextraction, the transformations and loading tend to also be static andhence do not need to be flexible since the sources tend to remainstatic. This leads to the inability to detect and react quickly to newtargets and predictors uncovered within the data.

The Dynamic Model Data Facility (DMDF), as disclosed herein, is designedto address a number of shortcomings that occur when using a traditionalstatic MDF. These include flexible access to additional data sources andmore timely access and manipulation of that data. When coupled with anautomated updating process, new relationships may be discovered on thefly, new data elements can be evaluated on the fly, and target marketscan be uncovered in an automated or semi-automated fashion.

In accordance with at least some embodiments, there are two componentsto the implementation of the DMDF. The first component corresponds to adata abstraction and access layer and the second component correspondsto a model data definition facility, or analytics meta-data.

Data Abstraction and Access Layer

The data abstraction and access layer, in some embodiments, takesadvantage of the recent advances in Big Data concepts. Primarily, BigData was previously considered as exhaust data and too voluminous tocapture and evaluate. Embodiments of the present disclosure, however,enable the Big Data to be quickly acquired, categorized and saved in ahigh performance manner. Taking advantage of parallelized storage andanalysis software (and accelerated with specialized hardware such asField Programmable Gate Arrays or FPGA's); data can now be stored in araw form and evaluated on the fly in a Create/Read mode (as opposed tothe traditional Create/Read/Update/Delete (CRUD) style of access).Storing data in its raw form is extremely useful since the algorithmsthat may aggregate, match or combine may be refined as time progressesand saving such transformed values may lose any potential predictivevalue. For example, in a traditional approach, the data base for asocial networking application may use CRUD utilities to keep track ofthe current location of a member.

For example, using a CRUD approach loses all information related towhere a person was before and when they were there. Storing informationrelated to each time the person changed locations as well as when theychanged locations can open up a whole new set of analytics such as howoften they move, as well as the impact it may have had on theirrelationship graphs. While traditional data bases and structure can beused to contain such log style of information, performance is usuallycoupled with indexes which can grow unbounded in size when large amountsof data are tracked and performance can deteriorate quickly.

Quick access to huge amounts of tagged unstructured data across a widetimeline provides the key for a successful dynamic construction of datafor predictive modeling. However, the data in its raw form does not lenditself to a structure that facilitates the use of predictive analyticsmodeling and data mining tools. In accordance with at least someembodiments, data for each entity can be combined in a manner such thatthe information for that entity is conveyed in a single row.Additionally, the values do not have to be computed in a consistentmanner as they are not all sums or simple counts, rather they can besummed and counted within a context such as new or within buckets orranges.

Analytics Meta-Data

The second component to the DMDF is the analytics meta-data, or dataabout how the data for the analytics is to be constructed as well asinformation about the data itself (e.g., ranges, data types, etc.).

The analytics meta-data, in some embodiments, defines a number of keyfeatures which may include, but are not limited to the following: thesource of the data (which may be from extracts of traditional systemssuch as customer relationship management (CRM) databases as well of thecollected raw Big Data); the transformations that must be performed onthat data; the relationship of one set of data to another (such asindirection for joining sets of data); and a selection criteria. In someembodiments, each meta-data definition identifies the process forconstructing one or more values for a single row or observation. An MDFcan be built on an ad-hoc basis using the most current data available atthe time, addressing the second need of timeliness.

Of course, the creation of an MDF may specify that the meta-data bedefined, and the meta-data is defined based on the definition of thepredictive model which uses an MDF to construct the model. Thisconundrum is addressed using business analysis to identify the potentialdependent variables and evaluating the potential independent variableson how the business users evaluate their success using a Key PerformanceIndicator (KPI) model building system. Initially, the approach used tobuild the first meta-data definitions to construct an MDF for the firstgeneration of a predictive model of the dependent variable. This can bean iterative process where multiple buckets or different timelines areadjusted to understand the impact it has on the model outcomes. However,once developed and the independent variables are identified (perhapsalong with the necessary transformations) the meta-data for constructionof the MDF is defined and can now be used to construct the file forprocessing on an ad hoc or as needed basis.

This process of building an initial MDF from a variety of sources toevaluate against a dependent variable has a tremendous additionalbenefit. When used in a continuous and automated manner, new dependentvariables may be considered to re-specify and build new models. Whenconnected with other system components that are designed to monitor theflow and structure of data as it is processed by the data abstractionlayer, potentially new and previously unconsidered relationships in datamay be discovered that may provide new business value. Once the modelshave been validated and approved the meta-data for these MDFs can beadded to the catalog of available MDF builders.

Automated Model Building And Usage

Consider the following scenario—one binary dependent variable and manyindependent variables exist from some commercial process wherein thedependent variable represents success and the independent variables areuniquely associated with the respective dependent outcome. Predictionconsists of assigning a probability for the binary dependent variable interms of a complete set of values for all independent variables. Inaccordance with at least some embodiments, the probability cancalculated using “scoring code” produced from model building withobservable values of the independent variables.

It is another aspect of the present disclosure to provide mechanisms formodel building, model validation, ongoing model usage, and ongoing modelvalidation, along with the methods of model retraining. It should benoted that retraining may comprise two distinct forms, re-calibration ofthe operative model (e.g., updating the model's coefficients with thelatest data) and re-specification of the model itself (e.g., creating anentirely new model by re-running feature creation and feature selectionover the latest data). The former is faster and cheaper but potentiallyless effective in the event of non-stationary independent variables.

The DMDF as disclosed herein can be continuously updated to createvariables for consideration by predictive modeling systems to ongoingcommerce with continuous testing for stationarity and stability.Responsive interactions between the two expert systems may require datathat be timely, accurate and of sufficient size for automatedoperational modeling to succeed.

In accordance with at least some embodiments, a model built with initialdata can be reiteratively retrained when significant deviations from theoriginal training and/or validation MDF are measured in the ongoingoperations. Both the initial training and validation data (MDFs) can beof the same vintage, existing in one system before the model isoperationalized in the training and validation system. In someembodiments, the data used to estimate ongoing deviations are collectedduring operations and are independent of the initial training andvalidation data.

This implies that the training and validation system, with its modelprediction mechanism embedded, has the capability to capture actual useroutcomes continuously. The deviations governing retraining decisions canbe decided by comparing ongoing model operation to the original trainingand validation data. Four distinct data sets, (e.g., in the form andfunction of MDFs) can be used to this end: the original training data;the original validation data; the ongoing operational data withoutcomes; and an ongoing set of data not used in the operational model(e.g., a random set that is withheld for control purposes).

In some embodiments, the input to model building is the initial MDF withits columns of dependent and independent variables and rows ofobservations associated with a customer or business-related instance.One dependent variable means one model. The interface out of modelbuilding creates scoring code capable of calculating a probability of“success” for the dependent variable given a specific set of independentvalues. These calculated probabilities may be referred to as“predictors.”

In some embodiments, the primary output of model building is the scoringcode to be used for predictions. The secondary output is the data usedfor model validation.

Assumptions and Effects

The assumption of constancy (stationarity) is made for the initialtraining data over its prediction period to the extent that thespecified model's predictions are expected to accurately reflect thetraining scenario. Anything that invalidates the assumption of constancythreatens to make predictions invalid. Multiple ways of judging theassumption are used during model operations.

The initial data constitute the initial comparator against which bothimprovement and constancy can be measured.

There is an observer effect in all data collection mechanisms. In someembodiments, the observer effect is maintained as constant as possibleacross differing data collection mechanisms.

The act of embedding a model in a business process also introduces amodeling effect that results from the interference of the model in theongoing commercial process. This modeling effect means a constant sourceof random inputs divided from the total commercial input stream isneeded for control purposes.

Endogenous Effects

The goal of model building is to predict a 0|1 dependent variable basedon a set of values for the model's independent variables. This makes thedependent variable endogenous to the management process. The assumptionof constancy means the average composition of the values of theindependent variable should be constant, along with their relationshipto the dependent variable. Changes in the average composition of theindependent variables will have potentially large positive or negativeeffects on the endogenous variable that may indicate lost constancy. Thecomposition of the independent variables is exogenous to the model.

Prediction Process

The prediction process involves assimilating a coherent set ofindependent values and calculating its probability in an increment oftime where the prediction has business value. After the increment oftime, the actual outcome associated with the coherent set of independentvalues is known and recorded.

After operating for an arbitrary time interval, there exists some numberof actual observations of the dependent variable associated with thepredicted probability for each actual set of independents. This is nothow the initial MDF was created. The initial MDF was created from somecollection mechanism separate from the expert system. A model was builtfrom the initial MDF, and then implemented for automated operationalmodeling.

The differences in collection mechanisms should be consciously managedand rationalized to maintain model efficacy.

Operational Scaling

In some embodiments, the expert system in which the model is built andused may have different operational characteristics from the trainingand validation system. In some embodiments, the training and validationsystem may be configured to run in real-time or near-real-time sopredictions can be offered in time to change behavior. The expertsystem, on the other hand, need not be so responsive and can conceivablyrun from a queue with service levels in the vicinity of 2 to 6 hourturnaround times.

The expert system may be configured with large memory loading and mayrequire interpretation so scalability is challenging. In someembodiments, one model building run per system image can be assumed forthe expert system. This means horizontal scaling by virtual systems.Each system will have unique data, in general.

MDF Size and Sequence

As previously mentioned, ongoing outcomes, both random, un-modeledholdout data and modeled data are used for ongoing validation. Thesequence of MDF(i) should be timed according to size minimums andseasonal factors. The lag in determining the actual outcome doesn't addsignificant latency to preparing the MDF.

Validation data does not have to be subject to the same size limits asmodeling data as significant deviations can be detected in modelefficacy before enough data to retrain a new model can be recorded andcalculated.

Handling Time Series

In general, the data flowing through a business are accuratelyconsidered time series. Since the expert system is based on correlationsin aggregated data, time series are not necessarily handled well.

Time series are characterized by waves (seasonality) and trends on topof a stable mean and variance for some random variable(s). Statisticalregression can deal with linearized trends. The waves of seasonalitycause the issues. The adjusted approach is to look at seasonality by itsnatural units (e.g., by weekday), and then train on weekday/weekday dataconstructions, if enough data exist. For example, the model is trainedfor next Monday on last Monday's data or the last four Mondays' data(seasonal weeks).

The modeling (beta) coefficients provide good information on thestrength and directionality of independent variables. Tracking trends inthe values of independent variables along with the effect in directionon dependent variable will help manage the assumption of constancy butcan be manual in nature.

Measuring Fitness

Assuming a model is well calibrated with a minimum but meaningful numberof significant coefficients, the model also could have discriminationand stability to predict outcomes effectively. This is known as themodel's goodness-of-fit (GOF). In some embodiments, the expert systemdisclosed herein is superior in finding models with GOF. Specifically,the expert system may employ a gains chart to depict GOF plus severalother techniques to ensure accurate calibration.

The discrimination of the scoring code's prediction is well measured bya gains chart conveying how the predicted probability separates 1's from0's in the dependent column for the available data. Model stability canbe estimated by comparing the gains of holdout data to the gains of thetraining data.

But ongoing measurement and estimation of GOF can benefit from slightlydifferent techniques. Since operational modeling moves away from modeltraining as the time semantics of the data change, the stepwisealgorithms and criteria used for tight calibration can becomeunreliable. An automated, repeatable and ongoing test should be executedagainst the latest data.

A model can have high calibration by fitting the available dataextremely well yet exhibit poor discrimination or have gooddiscrimination but poor calibration. For example, consider a repeatablebut random process that can be accurately modeled as purely random. Theresult is good calibration but poor discrimination. On the other hand,consider adding a constant to the output probability of a good model.The calibration would be off by the amount of the constant but thecollating sequence of probabilities is unchanged so discrimination wouldbe unchanged at any cut-point.

To effectively challenge the hypothesis that the model fits the data,the training and validation system measurements can capture changes inthe independent variables and their relation to the dependent variable,the endogenous phenomenon being modeled, sufficient to concludesomething has changed. Since the operational data that are being modeledcontain a “modeler” effect, a source of unaffected data for controlshould be collected and evaluated continuously. Measurements andestimates can be drawn from the ongoing control data and then comparedto estimates from preexisting validation data. Other comparisons totraining data and operationally modeled data are also available.

The basic test approach assumes that a model that fits the data is agood model and requires estimable deviations from key statistics toreject that assumption, called the null hypothesis. Instead of relyingon a fixed level of significance, one or more statistical significancetesting values may be used in determining a significance level. In oneembodiment, the p-value associated with the observed measurements andassumptions are reported for action. Test statistics are estimated tocover calibration and discrimination. A user-implemented decisioncriterion is then used to make the ultimate retraining decision.

In general, these test statistics approximate either normal (forproportions testing), or chi-square (for calibration testing)distributions. Once a test statistic is calculated for the appropriatesituation, it will be converted to a p-value for a two-sided test. Aone-sided test is also a possibility. The p-values can be reportedwithin their respective contexts to the user for a decision. In the caseof testing for discrimination, several p-values will be available, asdescribed below and then input into user controlled decision criteria.

To track ongoing GOF, statistics on both calibration and discriminationfor current data will be compared to the pre-existing validation(holdout) data from model building. Maintaining a chosen model is thenull hypothesis, whether for calibration or discrimination. Every testof significance has a stronger conclusion with more data and a lowerp-value.

Model Retraining

There are two possibilities (other than nothing) in responding todecisions taken for perceived changes in model assumptions: (1)re-calibrate the coefficients for the existing model or (2) re-specify anew model with potentially new features.

One possibility is to re-calibrate on regular intervals using a movingdata horizon, such as adding a new day and dropping an old day from thetraining data on a daily basis, then recalibrating the coefficients forthe incumbent model. The p-values can be used to decide when tore-specify an entirely new model. Re-calibrate daily and re-specify asindicated by above statistics.

A possible issue with this approach follows from the fact that far lessdata are needed to make a re-training decision than are needed tocompletely re-specify a new model. In any case, there is a strongbusiness incentive against capturing a large fraction of random,un-modeled data. A paradox is possible where the need to re-specify isestablished but insufficient data for re-specification exists. Regularre-calibration could proceed until enough “new” independent data existfor re-specification. Or, operational data could be used forre-specification.

The above process will present the user with a constellation ofp-values, one for calibration and several from differing cut-points forTrue Positive Rate (TPR) and False Positive Rate (FPR). Since thedecision to retrain is binary, these p-values are converted into a yesor no decision. This is accomplished by scanning multiple rules andre-specifying the model if any rule is satisfied (or not). For example,the Chi Square statistic in a rule for calibration plus the TPRproportional differences between initial holdout scored data and ongoingrandom scored data at the first, third and fifth deciles, respectively,produce p-values for three rules. Each rule can have differentthresholds for satisfaction.

As can be seen above, a commercial process with an important 0|1dependent variable can be associated with a set of independentvariables. The commercial process continuously provides data collectionopportunities. An intervention is designed using a model to predict thedependent outcome. The actual outcome of the intervention can bedetermined within the window of utility for these data.

One objective is to improve intervention outcomes with prediction.Purely random outcomes (no model prediction) and outcomes resulting fromthe intervention (model operations) are aggregated into separate files,a sequence of control MDFs and a sequence of MDFs of operational data.These MDFs are used to analyze model performance and to reactautomatically when identified conditions warrant.

Continuous dependent variables potentially need other treatment as wellas survival, clustering/CF and tree classification situations.

The present disclosure will be further understood from the drawings andthe following detailed description. Although this description sets forthspecific details, it is understood that certain embodiments of theinvention may be practiced without these specific details. It is alsounderstood that in some instances, well-known circuits, components andtechniques have not been shown in detail in order to avoid obscuring theunderstanding of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is described in conjunction with the appendedfigures:

FIG. 1 is a system diagram depicting interactions between an expertsystem and a testing and validation system in accordance withembodiments of the present disclosure;

FIG. 2 is a detailed system diagram depicting an expert system inaccordance with embodiments of the present disclosure;

FIG. 3 is a block diagram of customer data instances in communicationwith an expert system in accordance with embodiments of the presentdisclosure;

FIG. 4 is a block diagram of nodes associated with a customer datainstance in accordance with embodiments of the present disclosure;

FIG. 5 is a flow diagram depicting a method of automatically testing andvalidating an operational model in accordance with embodiments of thepresent disclosure;

FIG. 6 is a diagram depicting an analytics platform system in accordancewith embodiments of the present disclosure;

FIG. 7 is a services architecture in accordance with embodiments of thepresent disclosure; and

FIG. 8 is an exemplary data structure row in accordance with embodimentsof the present disclosure.

DETAILED DESCRIPTION

The ensuing description provides embodiments only, and is not intendedto limit the scope, applicability, or configuration of the claims.Rather, the ensuing description will provide those skilled in the artwith an enabling description for implementing the described embodiments.It being understood that various changes may be made in the function andarrangement of elements without departing from the spirit and scope ofthe appended claims.

Referring initially to FIG. 1, a system 100 is depicted as including afirst expert system 104 and a second expert system 108. The system 100,in accordance with at least some embodiments, may correspond to abusiness analytics system and may be deployed on one or more localenterprise servers, via a web-based architecture (e.g., as Software as aService (SaaS), as a cloud-based service, etc.), via cluster computing,or via any other known architecture. In other words, the expert systems104, 108 may correspond to systems comprising one or more servers,computers with processors and memory, virtual machines, FPGAs, ASICs, orcombinations thereof.

One or both of the expert system 104, 108 may correspond to any computeror computer system that emulates the decision-making ability of a humanexpert. Expert systems are designed to solve complex problems byreasoning about knowledge, like an expert, and not by following aprocedure of a developer. Examples of expert system include, withoutlimitation, GainSmarts® data mining system, SAS® Enterprise Miner™,IBM's SPSS Clementine®, and the like. Any one or combination of thesesystems or any other type of known or yet to be developed expert systemcan be used in accordance with embodiments of the present disclosure.

In some embodiments, the first expert system 104 is loosely coupled tothe second expert system 108. As an example, the first expert system 104may correspond to a system in which models are built and deployed whilethe second expert system 108 may correspond to a system in which modelsare made operational and are continually tested and/or verified. As amore specific example, the second expert system 108 may correspond to asystem that applies predictive models to ongoing commerce that isoccurring at the first expert system 104. The second expert system 108may be configured to continuously test the models being deployed in thefirst expert system 104 for stationarity and/or stability. In someembodiments, responsive interactions between the expert systems 104, 108depend upon data exchanges that are timely, accurate, and of sufficientsize for automated operational modeling to succeed. Specifically, thefirst expert system 104 may be configured to generate and providescoring code (J) to the second expert system 108 while the second expertsystem 108 may be used to manage GOF for the model(s) built in the firstexpert system 104.

The second expert system 108 may be structured such that a model builtwith initial data in the first expert system 104 is reiterativelyretrained when significant deviations from the original training and/orvalidation model are measured in ongoing operations. Accordingly, thesecond expert system 108 may be provided with the capability to captureactual user outcomes continuously from the first expert system. Detailsof the interactions between the expert systems 104, 108 are depicted inFIG. 2.

FIGS. 2-8 depict further details of an analytics system, an analyticsplatform, and components thereof. It should be appreciated that some orall of the elements depicted in FIGS. 2-8 may be used to implement someor all of the disclosed functions and features.

FIG. 2 shows a detailed system diagram depicting an expert system 200 inaccordance with embodiments of the present disclosure. As shown, theexpert system 200 may join one or more customers 208 with the system 200over a communication network 204. In accordance with at least someembodiments of the present disclosure, the communication network 204 maycomprise any type of known communication medium or collection ofcommunication media and may use any type of protocols to transportmessages between endpoints. The communication network 204 may includewired and/or wireless communication technologies. The Internet is anexample of the communication network that constitutes an InternetProtocol (IP) network consisting of many computers, computing networks,and other communication devices located all over the world, which areconnected through many telephone systems and other means. Other examplesof the communication network 204 include, without limitation, a standardPlain Old Telephone System (POTS), an Integrated Services DigitalNetwork (ISDN), the Public Switched Telephone Network (PSTN), a LocalArea Network (LAN), a Wide Area Network (WAN), a cellular network, andany other type of packet-switched or circuit-switched network known inthe art. In addition, it can be appreciated that the communicationnetwork 204 need not be limited to any one network type, and instead maybe comprised of a number of different networks and/or network types.Although several examples provided herein may depict a hyper texttransfer protocol (HTTP) application level interface, other forms ofapplication level interface may be used. For instance, technologies suchas messaging, queues, and/or peer-to-peer may equally apply, to name afew.

In some embodiments, the core functionality of the expert system 200 maybe used across one or more customers 208. Additionally or alternatively,the system 200 can track and score visitor traffic. The system 200 maycategorize each uniform resource locator (URL) and capturehits/category, time/category, and combinations thereof. In oneembodiment, the system 200 may track and score category position in adata stream. This data stream can include, but is not limited to, text,video, file, multimedia, combinations thereof, and the like. As can beappreciated, the system 200 is extensible and may determine signalingwith any external system.

An enterprise service bus 212, in communication with one or morecustomers 208 via the communication network 204, may include messageprocessors 216 and a message archiver 224. In some embodiments, themessage processor 216 may be used to control how messages are sent andreceived within a message flow. Typically, the message processor 216 isconfigured to handle messages by applying business logic to a givenmessage and routing the message to an appropriate application. In somecases, a message archiver 220 may be employed to optimize server load,by moving messages from the server to the message archiver 220. In otherwords, the efficiency of servers, especially in Big Data applications,is paramount to optimal system performance characteristics. As such,message archiver 220 may be used to move traffic from a server to themessage archiver 220. In some cases, the message archiver 220 may scanfor duplicate data and even act to remove redundant data from thesystem. The message archiver 220 may store the data in an archived data224 storage location. This archived data 224 may be compressed prior tostorage.

In some embodiments, an MDF may be generated on a periodic basis or onthe fly. As an MDF is used in daily/periodic processing 248, the filecan be updated to account for changes detected in the system. Forexample, as data is being scored in real-time 244, the MDF may be causedto recalibrate and/or rebuild based on how close the score is topredicted values of constancy. In other words, the automated scoringused by the system 200 may allow for the calibration, validation, andeven updating of the MDF. Via the cache admin 232 the system 200 maydetermine one or more instances of matching content through use of theclustered cache 236. By way of example, the clustered cache 236 may beused to store results and/or scoring associated with the MDF.

It is an aspect of the present disclosure that load testing tools 240may be in communication with one or more components of the system 200.Among other things, the load testing tools 240 may submit the system 200to traffic and other performance tests, including but not limited tosimulated workloads. In some embodiments, the load testing tools 240 maybe employed to test the software and/or system performance and can evenbe configured to measure response times under simulated traffic and/orloads.

Referring now to FIG. 3, a block diagram of customer data sources incommunication with an expert system is shown in accordance withembodiments of the present disclosure. As shown, customer instances312A-N may be grouped together to run as a cluster 308. In someembodiments, customer instances 312A-N may include, but are in no waylimited to, message processors 316A-N, message archivers 320A-N, andmessage data 324A-N. The function of each of these components may besimilar, or identical to, the description previously provided formessage processors, archivers, and archived data in reference to FIG. 2.It is an aspect of the present disclosure that the cluster 308 mayinclude one or more customer instances 312A-N. One example of utilizingthe clustered customer instances 312A-N can include the benefit ofallowing an application to run on parallel servers.

FIG. 4 depicts a block diagram of nodes associated with a customer datainstance in accordance with embodiments of the present disclosure. Oneor more nodes 412A-N are shown in a cluster 406. Utilizing clusteringcan allow applications to run on parallel servers, and as such, the loadmay be distributed across different servers. In addition to offeringcapability redundancy in the event that a single server of the parallelservers fails, performance can be improved by adding more nodes to eachcluster 406.

In some embodiments, each node 412A-N may include a message processor416A-N, a software framework client 420A-N, and a software frameworkclient distributed file system, or file system, 424A-N. The messageprocessor 416A-N may be used to control how messages are sent andreceived within a message flow. As such, each message processor 416A-Ncan be configured to handle messages by applying business logic to agiven message and routing the message to an appropriate application. Thesoftware framework client 420A-N may be used to support runningapplications on large clusters of hardware. Additionally oralternatively, the software framework client 420A-N may be configured tosupport data-intensive distributed applications. In some embodiments,the file system 424A-N may include, but is not limited to a distributed,scalable, and portable file system written for the software frameworkclient 420A-N.

In one embodiment, a load balancer 408 may be employed to process allrequests and dispatch them to server nodes 412A-N in the cluster 406. Ascan be appreciated, the load balancer 408 may be used to handle requestssent via a web browser client. The client web browser may use policiesassociated with the load balancer 408 to determine which node willreceive a particular request from the web browser.

FIG. 5 depicts one example of a model retraining process in accordancewith at least some embodiments of the present disclosure. The methodbegins by checking the data of an MDF for cleanliness and structure(step 504). This particular step may be performed in the first expertsystem 104, the second expert system 108, or combinations thereof

A sample of the data may then be randomly obtained and partitioned intoat least two portions (step 508). In some embodiments, the random sampleof data may be randomly divided into ⅓rd and ⅔rd partitions.

Thereafter, the first expert system 104 (or one of several otherregression systems) may be used to build a logit or other 0|1 model fromthe ⅔rd partition (step 512) and validate on the ⅓rd partition (step516).

The model may then be implemented in a scenario with a congruent dataflow in such a way that commerce can be influenced by a prediction on a0|1 dependency in the data (step 520). During implementation of themodel in the first expert system 104, a small random fraction of theongoing data may be held out such that the held ongoing data can beprovided to the second expert system 108 (step 524).

As the model continues to be implemented, actual outcomes may beperiodically gathered and integrated into MDF's (modeled and random)such that the actual outcomes can be used for scoring and evaluation(step 528).

The second expert system 108 may then consider regular recalibration(without any tests) using a moving MDF (step 532). Test statistics maythen be calculated for calibration and discrimination (step 536). As aspecific but non-limiting example, ten pairs of proportions may becalculated at each decile for both TPR and FPR and these pairs ofproportions may be used for discrimination.

Based on the calculated test statistics, a multi-rule decision criterionmay be used to determine whether or not re-specifying is indicated (step540).

FIG. 6 is a diagram depicting an analytics platform system 600 inaccordance with embodiments of the present disclosure. In someembodiments, the creation of an MDF includes receiving data from one ormore target analytics data sources 604. The data sources 604 mayinclude, but is in no way limited to, data from customer relationshipmanagement 606A, billing 606B, support 606C, transactional 606D, quality606E, delivery 606F, and more 608. The analytics platform system 600 mayemploy an extract, transform, and load (ETL) module 612 to incorporateand normalize data from the data sources 604. Among other things, theETL module 612 may be configured to join data values from the varioussources 606A-F, aggregate values, and even aggregate counts acrossvarious periods (e.g., time periods, etc.).

In some embodiments, the ETL module 612 may employ the use of specifictransformational data structures 616. The transformational datastructures 616 can include rules and/or functions that are applied todata received by the ETL module 612. For instance, the transformationaldata structures 616 can be applied to determine and even prepare thedata for storage in the enterprise data warehouse 620. Although somedata may require multiple transformations before it is stored, it isanticipated that other data may require minimal or even notransformations before it is directed to an end target or enterprisedata warehouse 620. Non-limiting examples of the transformationsperformed via the transformation data structures 616 of the ETL module612 may include one or more of, joining data from multiple sources(e.g., target analytics data sources 604, etc.), aggregation (e.g.,summing multiple data rows, summing values, etc.), encoding values,deriving new values, sorting, and the like.

Once the data has been received, extracted, and in some casestransformed, the data may be loaded into the enterprise data warehouse620. The enterprise data warehouse 620 may be configured to store data,in a historical form, overwrite old data, and/or combinations thereof.As such, the data in the enterprise data warehouse 620 may update storeddata periodically, as required. In one example, data extracted from oneor more of the data sources 604 can be used to define a particularbusiness person at a company. Although various departments at thecompany may identify the business person in different manners, and evenby different metrics, the data stored in the enterprise data warehouse,can be arranged to present all of the identifying characteristics in auniform manner. Among other things, the uniform presentation manner canallow the business person to be identified by any department byreferring to the stored data in the enterprise data warehouse 620.

The MDF module 624 may periodically interact with the ETL module 612and/or its various components 616, 620. This periodic interaction mayrange from continual monitoring to specific timed, pseudo-random, orrandom, monitoring time periods. Additionally or alternatively, theinteraction by the MDF module 624 may be dynamic in nature, and as such,can be configured to detect and react quickly to data targets andpredictors that are contained within the data. In some embodiments, theMDF module 624 may be a part of the analytics platform 628 or share itsmonitoring functionality with one or more components of the analyticsplatform 628. In one embodiment, the MDF module may be locatedseparately and apart from the analytics platform 628.

In some embodiments, the analytics platform 628 may include one or moreanalytics engines 640. The analytics engines may include, but are notlimited to, expressions language, simulation and forecasting components,scoring 642A, statistical regression 642B, simulation 642C, predictivestreams 642D, third-party models 642E, and more engines 646. It is anaspect of the present disclosure that the analytics platform 628 maycommunicate with one or more of, the target analytics data sources 604,the ETL module 612, the MDF module 624, browser access interfaces 644,third-party online analytical processing (OLAP) modules 648, and/or thevarious subcomponents that make up each of these components.

It is an aspect of the present disclosure that the analytics platform628 may be configured to generate and/or refine MDFs used in theanalysis of data. The analytics platform 628 may access the data sources604 and ETL module 612 in modifying, creating, and/or determining aneffectiveness of MDFs. In some embodiments, the efficacy of an MDF maybe evaluated based on metrics stored in performance metrics tables 632and business execution key performance indicator (KPI) models 636. Inthe event that unexpected or outlying values are detected, whether viacomponents of the analytics platform 628 or via third-party OLAPreporting, or in the event that scoring of the MDFs is found to yieldlow scoring results, the analytics platform 628 may update one or moreMDFs to rebuild, retrain, or re-specify the files. As provided hereinand above, updating MDFs may be performed dynamically and continually asrequired to at least yield stationarity in measured and predictiveanalytics.

In some cases, the system 600 may allow an end-user or client theability to monitor, search, view, and even manage data via a browseraccess interface 644. The browser access interface 644 may be built uponone or more computer programming language platforms (e.g., Java®, etc.)to provide a web-based front end to data collections via the analyticsplatform 628.

Referring to FIG. 7, a services architecture 700 and hierarchy is shownin accordance with embodiments of the present disclosure. Among otherthings, the services architecture 700 can serve as a conceptualframework that characterizes and standardizes the functions of theanalytical system as logical layers. The first logical layer isrepresentative of data services 704, which can be associated with one ormore data sources 604 as described above. The second logical layer isrepresentative of analytics services 708 and may be configured to store,generate, and even modify MDFs, as disclosed herein. The transport layer712, acting as the third logical layer in the services architecture 700,may be configured to provide source to destination delivery of messagesand/or data. Additionally or alternatively, the transport layer 712 maybe responsible for flow and error control of data in the system. Typicaltransports may include hyper text transfer protocol (HTTP) or secureHTTP (HTTPS) for web-based applications. In some cases, the transportsmay incorporate secure shell (SSH) cryptographic network protocol tointegrate event handlers.

In some embodiments, the applications layer of the architecture 700 mayinclude one or more application programming interface(s) (API) 716,adaptive applications 720, and the like. The API 716 can be configuredfor a specific computer programming language, such as Oracle's® Java®programming language. Among other things, the API 716 can provide forsession and native data semantics. The adaptive applications 720 mayinclude those applications that are embedded into a businessenvironment. In some cases, the applications may be embedded via one ormore of, enhanced expressions analytics language, data wrapped into asignal message, and integrated via at least one computer programminglanguage adapter. It is an aspect of the present disclosure that otherapplications (e.g., parent applications, etc.) may stream data to theadapted application 720.

In one embodiment, the service requestor 728 may act to initiate and/orrespond to event requests 736. As shown in FIG. 7, the request response732 is shown as flowing from the service requestor through the analyticsservices 708. Event requests 736 can be initiated via the servicerequestor through analytics services 708 and to an adaptive application720. In some embodiments, the adaptive application 720 may submit anevent request 736 to the service requestor 728 directly. The servicesarchitecture 700 may be distributed 740 for further monitoring andmanagement.

FIG. 8 is a block diagram depicting an exemplary data structure 800 rowused in accordance with embodiments of the present disclosure.Specifically, the data structure 800 is shown as a single row whichrepresents combined data for an entity. In this example, the entity maybe a customer of a particular service or business. As such, thecorresponding data structure 800 may include a customer date field 804,first cumulative count field 808, second cumulative count field 812,total purchases field 816, purchases in a 30-day period field, purchasesin a 60-day period field, days since last purchase field, and morefields 828 representing values that may be used in analytics. Althoughthe data structure represents data associated with an entity in a singlerow, is should be appreciated that other rows 836 may be created torepresent other entities.

In the example above, the customer date field 804 may comprise data thatidentifies the date the entity became a customer of the service orbusiness. This field 804 may be used by the analytics platform 628 increating and/or refining an MDF. In some cases, the analytics platform628 may utilize the data contained in the field 804 to populate otherfields of the data structure 800. Additionally, or alternatively, thecustomer date field 804 may include data used to order the entity androw among a series of rows in analytics processing.

The first and second cumulative count fields 808, 812 may comprise datathat has been summed and counted within a particular context. In somecases, as data is extracted and transformed, it may be stored in one ormore of the data fields in the data structure 800. The MDF can definethe creation of the data fields disclosed herein. Additionally oralternatively, the MDF may determine that joined sets of data can berepresented as a single data field in the data structure 800. Forexample, the cumulative count fields 808, 812 can include data thatchanges over time or is modified in result to a cumulative change tosource data. The cumulative count fields 808, 812 may include datacombined from other data fields of the data structure 800.

The total purchases data field 816 may comprise data that is takendirectly from one or more data sources 604 or data that has beentransformed via the ETL module 612. In the present example, the field816 may represent the total number of purchases the entity has madesince the entity became a customer. In some cases, the MDF may determinethat valuable analytics information can be associated with purchasesmade by an entity over certain time periods. For instance, the MDF maydetermine to add a 30-day purchases data field 820, a 60-day purchasesdata field 824, and even more depending on the specific results desired.Using this data, an anticipated timeline of customer behavior may bemapped for one or more entities. Among other things, this mappedbehavior can be used to predict trends, determine worth, and enhancefocus of strategic advertising.

The days since last purchase data field 828 may comprise data related toan entity's purchase behavior. Specifically, the field 828 may be usedto track the time that has passed since the entity made a purchase.Similar to the other data fields of the data structure 800, this datamay be extracted from one or more data sources 604, generated via atransformation of extracted data, or created via implementation of anMDF pulling information from one or more of the data fields associatedwith the data structure 800. The data field 828 may provide usefultracking data based on recent purchases made. In one implementation, theanalytics platform 628 may include or exclude an entity in variouscalculations based on recency, or relevant time assessment. Additionallyor alternatively, one or more values and/or data fields associated withan entity may be included or excluded in modifying, refining, and/orgenerating an MDF. In the present example, recency and relevancy may bedetermined by referring to the days since last purchase field 828 of thedata structure 800.

In the foregoing description, for the purposes of illustration, methodswere described in a particular order. It should be appreciated that inalternate embodiments, the methods and steps thereof may be performed ina different order than that described. It should also be appreciatedthat the methods described above may be performed by hardware componentsor may be embodied in sequences of machine-executable instructions,which may be used to cause a machine, such as a general-purpose orspecial-purpose processor or logic circuits programmed with theinstructions to perform the methods. These machine-executableinstructions may be stored on one or more machine readable mediums, suchas CD-ROMs or other type of optical disks, floppy diskettes, ROMs, RAMs,EPROMs, EEPROMs, SIMs, SAMs, magnetic or optical cards, flash memory, orother types of machine-readable mediums suitable for storing electronicinstructions. Alternatively, the methods may be performed by acombination of hardware and software.

Specific details were given in the description to provide a thoroughunderstanding of the embodiments. However, it will be understood by oneof ordinary skill in the art that the embodiments may be practicedwithout these specific details. For example, circuits may be shown inblock diagrams in order not to obscure the embodiments in unnecessarydetail. In other instances, well-known circuits, processes, algorithms,structures, and techniques may be shown without unnecessary detail inorder to avoid obscuring the embodiments.

Also, it is noted that the embodiments were described as a process whichis depicted as a flowchart, a flow diagram, a data flow diagram, astructure diagram, or a block diagram. Although a flowchart may describethe operations as a sequential process, many of the operations can beperformed in parallel or concurrently. In addition, the order of theoperations may be re-arranged. A process is terminated when itsoperations are completed, but could have additional steps not includedin the figure. A process may correspond to a method, a function, aprocedure, a subroutine, a subprogram, etc. When a process correspondsto a function, its termination corresponds to a return of the functionto the calling function or the main function.

Furthermore, embodiments may be implemented by hardware, software,firmware, middleware, microcode, hardware description languages, or anycombination thereof. When implemented in software, firmware, middlewareor microcode, the program code or code segments to perform the necessarytasks may be stored in a machine readable medium such as storage medium.A processor(s) may perform the necessary tasks. A code segment mayrepresent a procedure, a function, a subprogram, a program, a routine, asubroutine, a module, a software package, a class, or any combination ofinstructions, data structures, or program statements. A code segment maybe coupled to another code segment or a hardware circuit by passingand/or receiving information, data, arguments, parameters, or memorycontents. Information, arguments, parameters, data, etc. may be passed,forwarded, or transmitted via any suitable means including memorysharing, message passing, token passing, network transmission, etc.

While illustrative embodiments of the disclosure have been described indetail herein, it is to be understood that the inventive concepts may beotherwise variously embodied and employed, and that the appended claimsare intended to be construed to include such variations, except aslimited by the prior art.

What is claimed is:
 1. A method, comprising: building a model; using themodel in a first expert system; obtaining at least some ongoingoperational data from the first expert system while the first expertsystem uses the model; and repeatedly testing the model against actualdata to determine whether or not to recalibrate or re-specify the model.2. The method of claim 1, wherein a second expert system performs thetesting of the model.
 3. The method of claim 2, wherein the secondexpert system obtains random samples of data from the first expertsystem to test the model.
 4. The method of claim 1, further comprising:obtaining an independent source of random data; and using theindependent source of random data as part of the testing.
 5. The methodof claim 1, wherein the model is tested by tracking a goodness-of-fit(GOF).
 6. The method of claim 5, further comprising: gathering one ormore statistics relevant to calibration for the actual data; gatheringone or more statistics relevant to discrimination for the actual data;comparing the one or more statistics relevant to calibration to a firstpre-existing validation data; comparing the one or more statisticsrelevant to discrimination to a second pre-existing validation data; andbased on the comparisons, determining the GOF for the model.
 7. Asystem, comprising: a processor; and a computer readable medium coupledto the processor and comprising instructions that, when executed by theprocessor, enable the processor to perform the following: build a model;use the model in a first expert system; obtain at least some ongoingoperational data from the first expert system while the first expertsystem uses the model; and repeatedly test the model against actual datato determine whether or not to recalibrate or re-specify the model. 8.The system of claim 7, wherein a second expert system performs thetesting of the model.
 9. The system of claim 8, wherein the secondexpert system obtains random samples of data from the first expertsystem to test the model.
 10. The system of claim 7, wherein theinstructions further enable the processor perform the following: obtainan independent source of random data; and use the independent source ofrandom data as part of the testing.
 11. The system of claim 7, whereinthe model is tested by tracking a goodness-of-fit (GOF).
 12. The methodof claim 11, wherein the instructions further enable the processorperform the following: gather one or more statistics relevant tocalibration for the actual data; gather one or more statistics relevantto discrimination for the actual data; compare the one or morestatistics relevant to calibration to a first pre-existing validationdata; compare the one or more statistics relevant to discrimination to asecond pre-existing validation data; and based on the comparisons,determine the GOF for the model.
 13. A computer learning system,comprising: one or more servers capable of building a model, using themodel in a first expert system, obtaining at least some ongoingoperational data from the first expert system while the first expertsystem uses the model, and repeatedly testing the model against actualdata to determine whether or not to recalibrate or re-specify the model.14. The computer learning system of claim 13, wherein a second expertsystem performs the testing of the model.
 15. The computer learningsystem of claim 14, wherein the second expert system obtains randomsamples of data from the first expert system to test the model.
 16. Thecomputer learning system of claim 13, wherein the one or more serversare further capable of obtaining an independent source of random dataand using the independent source of random data as part of the testing.17. The method of claim 13, wherein the model is tested by tracking agoodness-of-fit (GOF).
 18. The method of claim 17, wherein the one ormore servers are further capable of gathering one or more statisticsrelevant to calibration for the actual data, gathering one or morestatistics relevant to discrimination for the actual data, comparing theone or more statistics relevant to calibration to a first pre-existingvalidation data, comparing the one or more statistics relevant todiscrimination to a second pre-existing validation data, and, based onthe comparisons, determining the GOF for the model.